Human microbiome variation associated with race and ethnicity emerges as early as 3 months of age

Human microbiome variation is linked to the incidence, prevalence, and mortality of many diseases and associates with race and ethnicity in the United States. However, the age at which microbiome variability emerges between these groups remains a central gap in knowledge. Here, we identify that gut microbiome variation associated with race and ethnicity arises after 3 months of age and persists through childhood. One-third of the bacterial taxa that vary across caregiver-identified racial categories in children are taxa reported to also vary between adults. Machine learning modeling of childhood microbiomes from 8 cohort studies (2,756 samples from 729 children) distinguishes racial and ethnic categories with 87% accuracy. Importantly, predictive genera are also among the top 30 most important taxa when childhood microbiomes are used to predict adult self-identified race and ethnicity. Our results highlight a critical developmental window at or shortly after 3 months of age when social and environmental factors drive race and ethnicity-associated microbiome variation and may contribute to adult health and health disparities.


Introduction
Two major goals of the human microbiome sciences include increasing the representation of undersampled groups in microbiome datasets [1][2][3] and understanding the tempo by which inequitable experiences, intergenerational inequality, and structural racism impact a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 microbiome variation and health outcomes [4][5][6][7][8]. Early-life social and environmental exposures can have large and lasting effects on child development and adult health, and perturbations to the gut microbiome may be important to future disease risk [9][10][11][12][13][14][15][16][17][18][19]. In the United States, adult gut microbiome diversity correlates with self-identified race and ethnicity [1,3]. However, socioeconomic status (SES)-neighborhood deprivation index, individual and parental education, or household income-is both correlated with adult gut microbiome diversity and is associated with race and ethnicity [20][21][22][23][24]. We emphasize that race and ethnicity are proxies for inequitable exposure to social and environmental determinants of health due to structural racism [6][7][8]25,26]. When human microbiome differences arise during development and whether or not distinguishing gut taxa overlap between childhood and adulthood are key questions that have implications for long-term effects of early life experiences, including structural racism, on microbiome variation.
To identify the developmental window when microbiome variation emerges, how long it persists during childhood, and which distinguishing taxa overlap between children and adults, we combined 8 gut microbiome composition datasets from 2,756 samples spanning 729 children between birth and 12 years of age throughout the US (S1 Table). We used caregiver-identified race (Asian/Pacific Islander, Black, White) and ethnicity (Hispanic, non-Hispanic) to capture complex interactions of multiple biosocial factors that influence gut microbiome composition, even though race and ethnicity are not biological categories that directly influence microbiome variation [5][6][7]26]. We used a diverse dataset of childhood microbiome samples to identify features of the gut microbiome that are potential markers of the inequitable experiences underlying health disparities. We selected samples from multiple 16S rRNA gene sequencing studies that represent a higher diversity of children than is commonly present in large analyses of the gut microbiome [1][2][3]. In the present study, 17.2% of samples were from non-White individuals, and 14.3% of samples were from Hispanic individuals. While the majority of samples from Hispanic individuals are from Hispanic White children, some Hispanic Black children are present in the dataset.

Microbiome variation emerges at or shortly after 3 months of age
Subject explained the greatest proportion of variation, consistent with other studies of the gut microbiome (S1 Fig). As age had the second strongest association with gut microbiome composition of the variables tested (Figs 1 and S1-S9 and S2-S4 Tables), we stratified samples by age and analyzed each age category separately while controlling for study differences to disentangle when in development race and ethnicity-associated microbiome variation originates. Delivery route and infant diet were not included in the age-stratified analysis, as they covaried with race and ethnicity (S10 and S11 Figs and S5 Table).
Notably, race and ethnicity did not significantly vary with gut microbiome alpha diversity (within-individual diversity) or beta diversity (between-individual diversity) in the early weeks and months of life, including the first week, 1 to 5.9 weeks, and 6 weeks to 2.9 months (permutational multivariate analysis of variance (PERMANOVA), all p > 0.05) (Figs 2, S2, S12, and S13 and S2 Table). However, at 3 to 11.9 and 12 to 35.9 months, gut microbiome composition based on UniFrac distances varied slightly but significantly by both race and ethnicity (PER-MANOVA, all p < 0.05) (Figs 2B, S2, S12, and S13 and S2 Table). Additionally, most measures of alpha diversity varied across racial categories at 3 to 11.9 months and across both racial and ethnic categories at 12 to 35.9 months (LME, p < 0.05) (Fig 2A and S4 Table). Pairwise comparisons confirmed that Black individuals had higher within-sample diversity than White individuals at 3 to 11.9 and 12 to 35.9 months for at least one of the 5 measures of diversity (Fig 2A and S4 Table) [27]. While higher alpha diversity is consistently associated with better cardiometabolic health and lower incidence of inflammatory disease in adults [28][29][30], studies have found mixed results in children. For example, studies of associations between alpha diversity and risk of allergic disease have found negative [31], positive [32], and no [33] association. From 3 to 11.9 years, race associated with gut microbiome composition using only unweighted UniFrac distances (PERMANOVA, all p < 0.05) (S12 and S13 Figs and S2 Table). Collectively, these results reveal that race and ethnicity associate with microbial diversity after 3 months of age, and, notably, this variation persists through childhood years.

Child gut microbiome variation recapitulates that of adults
To identify differentially abundant taxa, we used analysis of compositions of microbiomes with bias correction (ANCOM-BC) for each variable of interest across all age categories. Age was included as a factor in the models, and numerous taxa were differentially abundant across age categories (S6-S9 Tables). The abundances of several taxa significantly were associated with race and/or ethnicity in all samples combined (S5-S9 Tables), including several that varied in abundance between age categories (S14 and S15 Figs). Taxa positively associated with breastfeeding (Bifidobacterium, Lactobacillus, and Staphylococcus) [34,35] were significantly negatively correlated with age, as expected (S14 and S15 Figs and S9 Table). These taxa were differentially abundant between racial or ethnic categories, likely due to differences in rates of breastfeeding across these groups (S10 and S11 Figs and S5 Table). Delivery route also differed between racial and ethnic categories-vaginal delivery was more likely than expected in White, Asian/Pacific Islander, and non-Hispanic children and less likely than expected in Black and Hispanic children (S10 and S11 Figs and S5 Table). However, some individual species within Bacteroides, which is often more abundant in vaginally delivered children [34,35], were more enriched in Black and Hispanic children (S9 Table), contrary to our expectations.
Notably, there was moderate overlap between studies for differentially abundant taxa (S10 Table). Of the 57 gut microbial taxa that varied in abundance between children of differing self-identified racial categories, 19 were previously identified as differentially abundant between Black and White adult individuals in a recent controlled study of gut microbiome variation [3] (Fig 3A and S9 Table). Four of the 19 overlapping taxa were higher in abundance in both Black children and adults compared with White children and adults, and 4 of the  Comparisons with whiskers that do not cross zero indicate a significant difference in alpha diversity between those 2 categories. Colors in the dot whisker plots denote alpha diversity metric, and dot shape and line type denote age category. (B) NMDS plots show weighted UniFrac distances between by race and ethnicity at 0-2.9 months, 3-11.9 months, and 12-35.9 months. Colors and 95% confidence ellipses in the NMDS plots denote race, and shape denotes ethnicity. Blue text in the panels highlights significant p-values. NMDS plots for additional age categories and unweighted UniFrac distances can be found in the Supporting information (S12 and S13 Figs). Data underlying this figure can be found in S1, S2, and S4 Data.
To detect differentially abundant taxa within each age category, we used generalized linear mixed models with a negative binomial distribution (ANCOM-BC requires more samples per group than we had within each age category). However, few taxa were identified as differentially abundant within each age category (S6-S9 Tables). No phyla or families were differentially abundant between racial and ethnic categories within any age category, and only one genus differed between White and Asian/Pacific Islander children (S6-S9 Tables). Of the 6 species that differed in abundance between racial categories and 4 species that differed in abundance between ethnic categories, none were found in more than one age group (S9 Table). Coprococcus, one of the differentially abundant taxa within a specific age group (12 to 35.9 months), was more abundant in non-Hispanic children and has been previously associated both with obesity and a high-fiber diet [43]. The other differentially abundant taxa within specific age groups did not have clear links to health-related outcomes in the literature. Overall, taxa with age-associated variation did not systematically vary by race or ethnicity.
We next used a machine learning approach to identify additional characteristics of the microbiome that may be markers of inequitable exposure to social and environmental determinants of health. A random forest classifier based on the abundance of genera spanning all childhood samples distinguished Black versus White versus Asian/Pacific Islander categories and Hispanic versus non-Hispanic categories with 87% accuracy. Notably, 13 amplicon sequence variants (ASVs)AU : Pleaseprovidefullspellingfor}ASV}atfirstmentioninthesentence}Notably; among the top 30 most important genera that increased classification accuracy in the model (S16 and S17 Figs and S11 Table) are taxa identified as differentially abundant between self-identified racial categories in both children in the current study and adults in previous work [3] (Fig 3B and S9 Table). For race, we used a 3-part model, and model performance estimated as area under the curve (AUC; values above 0.5 indicate the classifier is performing better than chance) was 0.914 ( Fig 3B). For ethnicity, we used a binary model, and AUC was 0.886 ( Fig 3B).
Additionally, we used the childhood microbiome data in a random forest model to assess if childhood microbiome variation predicts that of healthy adults in the American Gut Project (AGP) dataset. As expected, compositional data from children did not reliably distinguish adults of differing racial categories (S18 and S19 Figs), with an AUC of 0.570. Twenty-six of the top 30 taxa identified as important microbiome characteristics in the model using data from children to predict adult metadata were also identified as important taxa in the random forest model that only used data from children (S16 and S19 Figs). However, the taxa with the highest importance differed with respect to the magnitude and direction of the differences between adults and children (S20 Fig).
Specifically, Enterobacteriaceae and Prevotella are highly important in child-child models but are of modest importance in child-adult models (S16 and S19 Figs), and their relative abundances are lowest in White children but highest in White adults (S20 Fig). Other studies have similarly found that specific taxa can be used to differentiate the gut microbiome of groups of people but that the direction of effect can differ between adults and children. Prevotella was highly important in both adult and child random forest models used to detect taxa that distinguish the gut microbiome across geographic regions, but the direction of the differences in relative abundance differed [44]. In children, Prevotella was more abundant in the US, but Prevotella was more abundant in adults outside of the US [44]. Alistipes was found to be protective against irritable bowel syndrome (IBS)AU : Pleaseprovidefullspellingfor}IBS}atfirstmentioninthese in adults, but predictive of IBS in children [45]. In contrast, other taxa have a similar direction of effect in both children and adults. Ruminococcus is specifically important in the child-adult models, likely due to similar variation in abundance between racial categories in both children and adults (S20 Fig). Higher abundances of Ruminococcus are linked with an increased risk of colorectal cancer [46], a disease for which there is a known racial health disparity [47,48]; however, we find that Ruminococcus is most abundant in White individuals, a group whose colorectal cancer risk is lower than that of Black individuals but higher than that of Asian/Pacific Islander individuals. Race-associated variation in the relative abundance of Ruminococcus across adult guts is not universal, is likely due to a subset of Ruminococcus species, and may interact with other factors such as stress or BMI [1,49]. Thus, it is difficult to know how or if the differences observed in the microbiome here contribute directly to health disparities.

Discussion
Race and ethnicity associate with gut microbiome composition and diversity beginning at 3 months of age, indicative of a narrow window of time (at or shortly after 3 months) and tempo when this variation emerges. Specifically, we found both race and ethnicity account for small but statistically significant proportions of the variation in gut microbiome composition, multiple taxa were differentially abundant between self-reported racial and ethnic categories, several of which were previously identified as differentially abundant in adults [3], and a random forest classifier reliably distinguishes caregiver-identified race and ethnicity. Notably, our findings do not support race-or ethnicity-associated variation appearing at birth or shortly after, when mother-to-infant and other mechanisms of vertical microbial transmission are expected to be strongest [50,51]. None of the differentially abundant taxa identified in the current study are known to be vaginally acquired by infants, and only 2 species are known to be vertically transmitted from the mother [51]. Instead, external factors are most likely shaping race-and ethnicity-associated microbiome variation at or shortly after 3 months. Our results highlight the impetus to increase the diversity of individuals included in studies in the microbiome sciences [1][2][3] and support the call for studies investigating how structural racism and other structural inequities affect microbiome variation and health [4][5][6][7].
The race-and ethnicity-associated differences in the gut microbiome likely reflect differences in environmental and social factors [6][7][8]25,26]. In the US, there are clear racial and ethnic disparities in health that are tied to differences in these same factors-psychosocial stressors, socioeconomic differences, culture, diet and access to food, access to healthcare and education, interactions with the built environment, and environmental pollutants [6,25,49,52,53]. These factors are important social and environmental determinants of health that have tangible impacts through the modification of human physiology [52,53]. In addition, there is evidence that the developmental trajectory of the gut microbiome is associated with immune system development, metabolic programming, antibiotic resistance, and risk of asthma, allergic, and autoimmune disease [17,33,36,54-60]. Thus, variation in social and environmental determinants of health that is associated with race and ethnicity may not only shape microbiome variation and impact health but also contribute to health disparities [6,7,20,25]. The tempo and types of factors contributing significantly to race-and ethnicityassociated gut microbiome variation are a priority for research.
Previous studies have identified race-and ethnicity-associated variation in the gut microbiome of children [27, [61][62][63][64], though they did not pinpoint when in development variation appears and the association is not consistent across studies [36,41,[65][66][67][68][69][70][71][72][73]. In particular, previous work demonstrated that sociodemographic factors related to rates of exposure to stress, access to grocery stores and healthcare, and environmental exposure risk are correlated with race-associated variation in the gut microbiome and that the effect of some of these factors, such as household income, are stronger in infants compared with neonates [27]. Due to the limitations of available metadata for all studies, we were not able to include all factors known to be important in our analysis, such as antibiotic exposure [10,27,74,75] 70], and various measures of maternal health during pregnancy [9,27,54,63,66,72,[77][78][79]. Many of the studies did not measure potentially important factors that are associated with race and ethnicity, including SES, discrimination or stress, and detrimental environmental exposures. Factors that are known to impact gut microbiome composition and were included in our models-age, sex, delivery route, and infant diet-were not independent of race and/or ethnicity (S10 and S11 Figs and S5 Table). While our study included a relatively high proportions of non-Hispanic Black and Hispanic White children, our inferences were limited by low numbers of Asian American/Pacific Islander children. The datasets used in the current study did not have a sufficient number of Middle Eastern, Native American, and Alaskan Native children to include those individuals in the analysis.
Self-identified race and ethnicity are complex concepts and have limitations. Self-identification varies over time, may not be reflected by predetermined categories used in surveys, and may not capture all aspects of race and ethnicity [80][81][82]. An additional limitation is that the majority of included studies were conducted in urban areas in distinct geographic locations. The data may not be representative of children from rural areas or the entirety of the US. The results of our study are also not generalizable to other countries due to cultural variation in definitions of racial and ethnic categories. These limitations highlight the necessity of future efforts to recruit a far greater diversity of participants for understanding human microbiome diversity [1][2][3].
During the first 3 months of age, typically high inter-and intraindividual variability in the infant gut microbiome may contribute to the effect of race and ethnicity, in addition to other maternal, environmental, and social factors that associate with the gut microbiome during this developmental period [35, 83,84]. Additionally, the rapid development and marked variation in abundance of microbial taxa within and between individuals continues for at least the first year of life [34, 85,86]. Differences in social exposures through childcare, dietary variation due to differential rates of breastfeeding and methods of starting solid food, and environmental exposures through time spent in green spaces may be especially impactful starting at 3 months of age and continuing throughout the first year [9][10][11][12][13][14][15][16][17][18][19]87,88]. Many studies of early life and external factor associations with gut microbiome variation have had limited power to detect the effects of multiple factors, finding few or inconsistent relationships between early life determinants and gut microbiome diversity and composition [10,17,76]. Our findings underscore the need for well-powered, longitudinal studies of diverse cohorts that comprehensively assess all internal and external factors known to affect the developmental trajectory of the microbiome [5][6][7]25,[89][90][91][92]. Other studies have found that the development of the gut microbiome appears to be particularly sensitive to environmental factors and early life events during the first 3 years of life [14,34,93,94]. Additional work is now needed to assess if social and environmental determinants of health begin to influence variation in the microbiome at or near 3 months of age in a way that is potentially important for understanding health disparities in adults, providing a relatively narrow window of time in which to identify potentially impactful factors.

Materials and methods
Eight datasets with 16S rRNA sequencing data and available race and ethnicity metadata were used in this study [27,66,67,70,72,95,96] (S21 Fig and S1 Table). Individuals between birth and 12 years of age, living in the US, with a caregiver-reported race of Black, White, or Asian/ Pacific Islander, and with a caregiver-reported ethnicity of Hispanic or non-Hispanic were included in the analysis. Individuals were not selected based on a known disease phenotype (e.g., type 1 diabetes). Study was included in all models as strata to control for the effects of different study parameters, and individual identity was included as a factor in all models to assess the impact of individual differences on microbiome communities. While sequencing method, primer choice, and sequencing depth did have a significant association with microbial community composition when included in models, including study as strata removed the effect of these study-specific parameters (S2 Table). As some of the included studies had multiple participants from the same family, we also tested if individual identity or family had a larger effect size. In all cases, individual identity explained a larger proportion of the variation than family (S2 Table).
Sequence analyses were carried out in QIIME2 (v.2021.4) [97]. Each study was individually imported into QIIME, and the DADA2 algorithm was used to denoise each study separately to allow us to use appropriate trimming and truncation parameters for each dataset. Feature tables and representative sequences from all studies were then merged using the fragment insertion method [98] to control for differences in amplification and sequencing methodologies between studies. The merged table was filtered to remove sequences absent from the insertion tree. Taxonomy was assigned using a Naïve-Bayesian classifier trained on the Greengenes 13_8 99% OTU full-length 16S rRNA gene sequence database. Mitochondria and chloroplast sequences were filtered from the merged feature table prior to downstream analysis.
Alpha and beta diversity indices were calculated in QIIME and exported for statistical analysis in R [99]. Linear mixed effects models as implemented in the lme4 package [100] were used to detect significant associations between race, ethnicity, age, sex, delivery route, and infant diet on multiple measures of within-sample diversity (Faith's PD, observed ASVs, Chao 1, Shannon diversity, and Pielou's evenness). Study and individual identity were included as random effects in all linear models to control for the effects of different study parameters and repeatedly sampling individuals. PERMANOVA, as implemented in the vegan package [101], was used to examine associations between race, ethnicity, age, sex, delivery route, and infant diet on unweighted and weighted UniFrac distances (example model: WeightedUniFracR ace + Ethnicity + Age + Sex + Delivery route + Infant diet + SubjectID, strata = Study). Study was included as the strata in the PERMANOVA models to constrain permutations within each study and control for study-specific methodological differences in sample collection and processing. For both the alpha and beta diversity analyses, we additionally examined the effect of sequencing technology, primer set, and sequencing depth (S2 and S4 Tables) (S7-S9 and S21 Figs). Analysis of composition of microbiomes was used to identify differentially abundant phyla, families, genera, and species across all samples using the ANCOM-BC package [102]. Generalized linear models using a negative binomial distribution were used to detect differentially abundant phyla, families, genera, and species within each age category using the glmmTMB package [103]. Random forest classification was performed using the mikropml package [104] in R. A totalAU : PleasenotethatasperPLOSstyle; numeralsarenotallowedatthebeginningo of 100 training/test data splits were used for each model, and 5-fold cross-validation was repeated 100 times for each of the 100 training/test data splits using the default settings of the run_ml() command. Median AUC, precision recall AUC (prAUC), accuracy, sensitivity, and specificity are reported for each model.   Table. Test statistics for differential abundance analyses at the species level. (XLSX) S10 Table. Genera identified as differentially abundant between self-identified racial categories across studies. (XLSX) S11 Table. Important